Background

CarPrices is a data set that contains 80 samples of Cadillac cars. We will take a look specifically at Price as a factor of Mileage for the Deville model, and take a look at how the kind of Trim style the Deville model has greatly influences the price.

Model

Below is the model for simple linear regression

\(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \epsilon_i\)

# Assuming 'Deville' is your dataframe containing the data

deville_lm <- lm(Price ~ Mileage, data = Deville)

b <- coef(deville_lm)

p <- plot_ly(data = Deville, x = ~Mileage, y = ~Price, type = "scatter", mode = "markers", 
             color = ~Trim,
             text = ~paste("Trim: ", Trim)) %>%
  layout(title = "Price vs Mileage",
         xaxis = list(title = "Mileage"),
         yaxis = list(title = "Price"))

# Add regression line to the plot
p <- add_trace(p, x = Deville$Mileage, y = b[1] + b[2]*Deville$Mileage, 
               type = "scatter", mode = "lines", line = list(color = "green"))

# Print the plot
p
# Set up a 1x2 grid for plots
par(mfrow=c(1,3))
plot(deville_lm, which=1)
qqPlot(deville_lm$residuals, id=FALSE)
plot(deville_lm$residuals)

summary(deville_lm)
## 
## Call:
## lm(formula = Price ~ Mileage, data = Deville)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4296  -2986   1027   1881   3870 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41106.3054  1295.4696  31.731  < 2e-16 ***
## Mileage        -0.2461     0.0607  -4.055 0.000362 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2667 on 28 degrees of freedom
## Multiple R-squared:   0.37,  Adjusted R-squared:  0.3475 
## F-statistic: 16.44 on 1 and 28 DF,  p-value: 0.0003624

Adjusted R-squared: 0.3475

Revised Model

Below is the equation for a two lines linear regression model

\(\underbrace{Y_i}_\text{Price} = \underbrace{β0 + β1X1i + β2X2i}_\text{ E(Yi)} + ϵ_i\)

\[ X_{2i} = \begin{cases} 1 & \text{if Trim = DHS Sedan 4D or DTS Sedan 4D"} \\ 0 & \text{if Trim = Sedan 4D} \end{cases} \]

# Create the scatter plot using plotly, coloring points based on the "Trim" variable


p <- plot_ly(data = Deville, x = ~Mileage, y = ~Price, type = "scatter", mode = "markers", 
        color = ~Trim,
        text = ~paste("Trim: ", Trim)) %>%
  layout(title = "Price vs Mileage",
         xaxis = list(title = "Price"),
         yaxis = list(title = "Mileage"))

b <- coef(deville_lm)

p <- add_trace(p, x = Deville$Mileage, y = b[1] + b[2]*Deville$Mileage, 
               type = "scatter", mode = "lines", line = list(color = "skyblue"))

p
# Find a variable for cadillac deville that you can add, that if you add, makes for a better fit.
# Try 2 lines, see what splits the values. Look at r2 values,
# Do just mileage
# When you add the variable, you will see the r2 jump.

split based on trim copy equation except for last change in slope term from statstics notebook do 3 graphs for new lines with regression model

Deville <- Deville %>%
  mutate(
    Trim_Case = case_when(
      Trim %in% c("DHS Sedan 4D", "DTS Sedan 4D") ~ 1,
      Trim == "Sedan 4D" ~ 0
    )
  )


# Fit linear models
lm_trim <- lm(Price ~ Mileage + Trim_Case, data = Deville)

# Obtain fitted values
# fitted_values1 <- predict(lm_dts_dth)

# Get coefficients
bd <- coef(lm_trim) 

# Create scatter plot
p <- plot_ly(data = Deville, x = ~Mileage, y = ~Price, type = "scatter", mode = "markers", 
             color = ~Trim,
             text = ~paste("Trim: ", Trim)) %>%
  layout(title = "Price vs Mileage",
         xaxis = list(title = "Mileage"),
         yaxis = list(title = "Price"))

# Add regression lines
p <- add_trace(p, x = Deville$Mileage, y = bd[1] + bd[2]*Deville$Mileage, 
               type = "scatter", mode = "lines", line = list(color = "lightblue"))

p <- add_trace(p, x = Deville$Mileage, y = (bd[1] + bd[3]) + bd[2]*Deville$Mileage, 
               type = "scatter", mode = "lines", line = list(color = "pink"))

p  # Print the plot
par(mfrow=c(1,3))
plot(lm_trim, which=1)
qqPlot(lm_trim, id=FALSE)
plot(lm_trim$residuals)

summary(lm_trim)
## 
## Call:
## lm(formula = Price ~ Mileage + Trim_Case, data = Deville)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1612.02  -220.33    59.19   380.80  1771.88 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.871e+04  3.895e+02   99.38  < 2e-16 ***
## Mileage     -3.054e-01  1.747e-02  -17.48 2.99e-16 ***
## Trim_Case    5.347e+03  2.973e+02   17.99  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 753.8 on 27 degrees of freedom
## Multiple R-squared:  0.9515, Adjusted R-squared:  0.9479 
## F-statistic: 264.7 on 2 and 27 DF,  p-value: < 2.2e-16

R squared: 0.9572